66 ◾ Bioinformatics
by one of the two-letter header record type codes. A record type code may have two-letter
subtype codes. Table 2.1 lists and describes the two-letter codes of the SAM header section,
and Figure 2.14 shows an example header section. Notice that the SAM file begins with
“@HD VN:1.0 SO:coordinate”, which indicates that the specification of SAM version 1.0
was used and the alignments in the file are sorted by the coordinate. We can also notice
that there are several @SQ header lines, each line is for the reference sequence used for the
alignment. The @SQ header includes the sequence name (SN), which is the chromosome
number, and sequence length (LN). The last two lines in the header section should include
@PG, which describes the program used for the alignment, and @CO, which describes the
command lines.
The alignment section begins after the header section. Each alignment line has 11
mandatory fields to store the essential alignment information. The alignment section may
have variable number of optional fields, which are used to provide additional and aligner
specific information.
Figure 2.15 shows a partial alignment section of a SAM file. The columns of the align-
ment section are split because they do not fit the page. Table 2.2 lists and describes 11 man-
datory fields of the SAM alignment section.
These 11 mandatory fields are always present in a SAM file. If the information of
any of these mandatory fields is not available, the value of that field will be replaced
with “0” if its data type is integer or “*” if the data type is string. Most field names are
self-explanatory.
TABLE 2.1 The Two-Letter Codes of the Header Section and Their Description
Code
Header Code Description
@HD
This header codes for metadata, and if it is present, it must be the first line of the SAM file. This
header line may include subtypes: VN for format version, SO for sorting order, GO for
grouping alignment, and SS for sub-sorting order of alignments
@SQ
This is for the reference sequence used for aligning the reads. A SAM file may include multiple
@SQ lines for the reference sequences used. The order of the sequences defines the order of
alignment sorting. The two most common sub-type codes used in this header line include SN
for reference sequence name and LN for reference sequence length
@RG
This header line is used to identify read group and it is used by some downstream analysis
programs (e.g., GATK) for grouping files based on the study design. Multiple lines can exist in
a SAM file. This line may include the ID for the unique read group identifier, BC for the
barcode sequence identifying the sample, CN for the name of sequencing facility, DS for
description to be used for the read group, DT for the date of sequencing, LB for the
sequencing library, PG for the programs used for processing the read group, PL for the
platform of the sequencing technology used to generate the reads, PM for the platform model,
PU for the platform unit, which is a unique identifier (e.g., flow cell/slide barcode), and SM
for the sample identifier, which is the pool name where a pool is being sequenced
@PG
This is the header line for describing the program used to align the reads. It may include ID for
the program unique record identifier, PN for the program name, CL for the command line
used to run the program, PP for the previous @PG-ID, DS for description, and VN for the
program version
@CO
This is the header line for a text comment. Multiple @CO lines are allowed